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Abstract 

Cross-validation (CV) is one of the main tools for 
performance estimation and parameter tuning in 
machine learning. The general recipe for comput¬ 
ing CV estimate is to run a learning algorithm sep¬ 
arately for each CV fold, a computationally expen¬ 
sive process. In this paper, we propose a new ap¬ 
proach to reduce the computational burden of CV- 
based performance estimation. As opposed to all 
previous attempts, which are specific to a particular 
learning model or problem domain, we propose a 
general method applicable to a large class of incre¬ 
mental learning algorithms, which are uniquely fit¬ 
ted to big data problems. In particular, our method 
applies to a wide range of supervised and unsuper¬ 
vised learning tasks with different performance cri¬ 
teria, as long as the base learning algorithm is in¬ 
cremental. We show that the running time of the al¬ 
gorithm scales logarithmically, rather than linearly, 
in the number of CV folds. Furthermore, the algo¬ 
rithm has favorable properties for parallel and dis¬ 
tributed implementation. Experiments with state- 
of-the-art incremental learning algorithms confirm 
the practicality of the proposed method. 


1 Introduction 

Estimating generalization performance is a core task in ma¬ 
chine learning. Often, such an estimate is computed using 
A:-fold cross-validation (fc-CV): the dataset is partitioned into 
k subsets of approximately equal size, and each subset is used 
to evaluate a model trained on the k — 1 other subsets to pro¬ 
duce a numerical score; the fc-CV performance estimate is 
then obtained as the average of the obtained scores. 

A significant drawback of fc-CV is its heavy computational 
cost. The standard method for computing a fc-CV estimate 
is to train fc separate models independently, one for each 
fold, requiring (roughly) fc-times the work of training a sin¬ 
gle model. The extra computational cost imposed by fc-CV 
is especially high for leave-one-out CV (LOOCV), a popu¬ 
lar variant, where the number of folds equals the number of 
samples in the dataset. The increased computational require¬ 
ments may become a major problem, especially when CV is 
used for tuning hyper-parameters of learning algorithms in 


a grid search, in which case one fc-CV session needs to be 
run for every combination of hyper-parameters, dramatically 
increasing the computational cost even when the number of 
hyper parameters is small{{] 

To avoid the added cost, much previous research went into 
studying the efficient calculation of the CV estimate (exact or 
approximate). However, previous work has been concerned 
with sp ecial m odels and problems; With the exception of 
Izbicki 120131, these methods are typically limited to linear 
prediction with the squared loss and to kernel methods with 
various loss functions, including twice-differentiable losses 
and the hinge loss (see Section o for details). In these 
works, the training time of the underlying learning algorithm 
is 0(n^), where n is the size of the dataset, and the main re¬ 
sult states that the CV-estimate (including LOOC V estim ates) 
is yet computable in 0{n^) time. Einally, Izbicki 120131 gives 
a very efficient solution (with 0{n + k) computational com¬ 
plexity) for the restrictive case when two models trained on 
any two datasets can be combined, in constant time, into a 
single model that is trained on the union of the datasets. 

Although these results are appealing, they are limited to 
methods and problems with specific features. In particular, 
they are unsuitable for big data problems where the only 
practical metho ds are incremental and run in linear, or even 
sub-li near time [Shalev-Shwartz et ai, 2011] [Clarkson et al.,\ 
2012). In this paper, we show that CV calculation can be 


done efficiently for incremental learning algorithms. In Sec¬ 
tion we present a method that, under mild, natural con¬ 
ditions, speeds up the calculation of the fc-CV estimate for 
incremental learning algorithms, in the general learning set¬ 
ting explained in Section [^(covering a wide range of super¬ 
vised and unsupervised learning problems), and for arbitrary 
performance measures. The proposed method, TreeCV, ex¬ 
ploits the fact that incremental learning algorithms do not 
need to be fed with the whole dataset at once, but instead 
learn from whatever data they are provided with and later up¬ 
date their models when more data arrives, without the need to 
be trained on the whole dataset from scratch. As we will show 
in Section 3.1 TreeC V computes a guaranteed-precision ap¬ 
proximation of the CV estimate when the algorithms produce 


' For example , the s emi-supervised anomaly detection method 
of Gomitz et al. (2013) has four hyper-parameters to tune. Thus, 
testing all possible combinations for, e.g., 10 possible values of each 
hyper-parameter requires running CV 10000 times. 













stable models. We present several implementation details 
and analyze the time and space complexity of TreeCV in 
Section In particular, we show that its computation time 
is only 0(log/c)-times bigger than the time required to train 
a single model, which is a major improvement compared to 
the fc-times increase required for a naive computation of the 
CV estimate. Finally, Sectionj^presents experimental results, 
which confirm the efficiency of the proposed algorithm. 


1.1 Related Work 

Various methods, often specialized to specific learning set¬ 
tings, have been proposed to speed up the computation of 
the k-CW estimate. Most frequently, efficient fc-CV com¬ 
putation methods are specialized to the regularized least- 
squares (RLS) learning settings (with squared-RKHS-norm 
regulariz ation). In particular, the generalized c ross-validation 
method I Golub et ai, 1979 ^^^hba, 1990) computes the 


LOOCV estimate in 0(n^) time for a dataset of size n from 
the solution of the RLS problem over the whole dataset; 
this is generalize d to fc -CV calculation in 0{n?/k) time by 
Pahikkala et al. \ 2006] . In the special case of least- squares 
support vector machines (LSSVMs), Cawley 1 2006[ shows 
that LOOCV can be computed in 0{n) time using a Cholesky 
factorization (again, after obtaining the solution of the RLS 
problem). It should be noted that all of the aforementioned 
methods use the inverse (or some factorization) of a special 
matrix (called the influence matrix) in their calculation; the 
aforementioned running times are therefore based on the as¬ 
sumption that this inverse is available (usually as a by-product 
of solving the RLS problem, computed in n(n^) time)0 

A related idea for approximating the LOOCV estimate 
is using the notion of influence functions, which measure 
the effect of adding an infinitesimal single point of proba¬ 
bility mass t o a distribution. Using this notion, Debruyne 
et al. | 2008| propose to approximate the LOOCV estimate 
for kernel-based regression algorithm s that use any twice- 
differentiable loss f unction. Liu et al. \ 20141 use B ouligand 
influence functions I Christmann and Messem, 2008) , a gener¬ 
alized notion of influence functions for arbitrary distributions, 
in order to calculate the fc-CV estimate for kernel methods 
and twice-differentiable loss functions. Again, these meth¬ 
ods need an existing model trained on the whole dataset, and 
require il{nf ) running time. 

A notable exception to the square-loss/differentia ble los s 
requirement is the work of Cauwenberghs and Poggio 1 2001] . 
They propose an incremental training method for support- 
vector classification (with the hinge loss), and show how to 
revert the incremental algorithm to “unlearn” data points and 
obtain the LOOCV estimate. The LOOCV estimate is ob¬ 
tained in time similar to that of a single training by the same 
incremental algorithm, which is n{n^) in the worst case. 

Clo sest to our approach is the recent work of Izbicki 
1 2013) : assuming that two models trained on any two sep¬ 
arate datasets can be combined, in constant time, to a single 


^In the abs ence of this assumption, stochastic trace estimators 
iGirard, 1989) o r numerical approxim ation techniques |Golub and] 
|von Matt, 1997[[Nguyen et ai, 20M) are used to avoid the costly 
inversion of the matrix. 


model that is exactly the same as if the m odel was trained 
on the union of the datasets, Izbicki 1 2013) can compute the 
fc-CV estimate in 0{n + k) time. However his assumption 
is very restrictive and applies only to simple methods, such 
as Bayesian classification0 In contrast, roughly, we only as¬ 
sume that a model can be updated efficiently with new data 
(as opposed to combining the existing model and a model 
trained on the new data in constant time), and we only re¬ 
quire that models trained with permutations of the data be 
sufficiently similar, not exactly the same. 

Note that the CV estimate depends on the specific parti¬ 
tioning of the data on which it is calculated. To reduce the 
variance due to different partitionings, the fc-CV score can be 
averaged over m ultiple random partitionings. For LSSVMs, 
An et al. ]2007) propose a method to efficiently compute the 
CV score for multiple partitionings, resulting in a total run¬ 
ning time of 0{L(n — b)^), where L is the number of differ¬ 
ent partitionings and b is the number of data points in each 
test set. In the case when all possible partitionings of the 
dataset are used, the co mplete CV (CCV) score is obtained. 
Mullin and Sukthankar ]2000) study efficient computation of 
CCV for nearest-neighbor-based methods; their method runs 
in time 0{n^k + ri^ log(n)). 


2 Problem Definition 

We consider a general setting that encompasses a wide range 
of supervised and unsupervised learning scenarios (see Ta¬ 
ble [T for a few examples). In this setting, we are given a 
dataset {zi,Z 2 , ■ ■ ■ ,Zn }J^ where each data point Zi = {xi, yf) 
consists of an input Xi G X and an outcome yi G y, for 
some given sets X and y. For example, we might have 
X C R‘^,d > 1, with y = {-fl,—1} in binary classifi¬ 
cation and V C K in regression; for unsupervised learning, 
V is a singleton: y = {NoLabel}. We define a model 
as a functiorj^ f : X ^ V that, given an input x G X, 
makes a prediction, f{x) G V, where 7^ is a given set (for 
example, V = {+1, —1} in binary classification: the model 
predicts which class the given input belongs to). Note that 
the prediction set need not be the same as the outcome set, 
particularly for unsupervised learning tasks. The quality of 
a prediction is assessed by a performance measure (or loss 
function) £: VxXxy^M. that assigns a scalar value 
£{p, x, y) to the prediction p for the pair (a;, y); for example, 
£{p,x,y) = l{pfy} for the prediction error (misclassifi- 
cation rate) in binary classification (where I {£'} denotes the 
indicator function of an event E). 

Next, we define the notion of an incremental learning al¬ 
gorithm. Informally, an incremental learning algorithm is 
a procedure that, given a model learned from previous data 
points and a new dataset, updates the model to accommo- 

^ The other methods considered by Izbicki )2013) do not satisfy 
the theoretical assumptions of that paper. 

Formally, we assume that this is a multi-set, so there might be 
multiple copies of the same data point. 

^Without loss of generality, we only consider deterministic mod¬ 
els: we may embed any randomness required to make a prediction 
into the value of x, so that f{x) is a deterministic mapping from X 
to V. 
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Table 1: Instances of the general learning problem considered 
in the paper. In iT-means clustering, Cj denotes the center of 
the jth cluster. 

date the new dataset at the fraction of the cost of training 
the model on the whole data from scratch. Formally, let 
■M C {/ : A” — 7 > V} be a set of models, and define Z* to 
be the set of all possible datasets of all possible sizes. Disre¬ 
garding computation for now, an incremental learning algo¬ 
rithm is a mapping C : {M U {0}) x Z* A4 that, given a 
model / from A4 (or 0 when a model does not exist yet) and 
a dataset Z' = (z{, ■ ■ ■, Zm)’ returns an “updated” model 

/' = C{f, Z'). To capture often needed internal states (e.g., 
to store learning rates), we allow the “padding” of the models 
in with extra information as necessary, while still view¬ 
ing the models as X —>■ V maps when convenient. Above, / 
is usually the result of a previous invocation of C on another 
dataset Z G Z*. In particular, £(0, Z) learns a model from 
scratch using the dataset Z. An important class of incremen¬ 
tal algorithms are online algorithms, which update the model 
one data point at a time; to update / with Z', these algorithms 
make m consecutive calls to C, where each call updates the 
latest model with the next remaining data point according to 
a random ordering of the points in Z'. 

In the rest of this paper, we consider an incremental learn¬ 
ing algorithm £, and a fixed, given partitioning of the dataset 
{zi, Z2, ■ ■ ■, Zn} into k subsets (“chunks”) Zi, Z2, ..., Z^- 
We use fi = £(0, Z \ Zi) to denote the model learned from 
all the chunks except Zi. Thus, the k-CV estimate of the gen¬ 
eralization performance of £, denoted Rk-c\, is given by 



i=l 


where i?, = = 1 , 2 , 

is the performance of the model fi evaluated on Zi. The 
LOOCV estimate Rn-cv is obtained when k = n. 

3 Recursive Cross-Validation 

Our algorithm builds on the observation that for every i and j, 
^ ^ i < j ^ k, the training sets Z\Zi and Z \ Zj are almost 
identical, except for the two chunks Zi and Zj that are held 
out for testing from one set but not the other. The naive k-CW 
calculation method ignores this fact, potentially wasting com¬ 
putational resources. When using an incremental learning al¬ 
gorithm, we may be able to exploit this redundancy: we can 
first learn a model only from the examples shared between 
the two training sets, and then “increment” the differences 
into two different copies of the model learned. When the ex¬ 
tra cost of saving and restoring a model required by this ap¬ 
proach is comparable to learning a model from scratch, then 
this approach may result in a considerable speedup. 


Algorithm 1 TreeCV ^s, e, /s..e^ 

input: indices s and e, and the model fa..e trained so far. 

if e = s then 

^ jijj ^ {fs..e{x),X, yj . 

return ^Rg. 
else 

Let m G- . 

Update the model with the chunks Z^+i,..., Z^, to get 

fs..m = £(/s..e! • . . , Zg). 

Let r G- TreeCV ^s, m, fs..m^ ■ 

Update the model with the chunks Zg,..., Zm to get 

fm+l..e — ^st • • ■ 5 ^m)- 

Let r r + TreeCV (m + 1, e, fm+i..e'j ■ 

return r. 
end if 


To exploit the aforementioned redundancy in training all 
k models at the same time, we organize the k-CV compu¬ 
tation process in a tree structure. The resulting recursive 
procedure, TreeCV(s, e,/s..e), shown in Algorithmre¬ 
ceives two indices s and e, 1 < s < e < fc, and a model 
fs..e that is trained on all chunks except Zg, Zg+i,..., Zg, 
and returns ( 1 /fc) J2l=g '^^e normalized sum of the per¬ 
formance scores Ri,i = s,..., e, corresponding to test¬ 
ing the model trained on Z \ Zi, on the chunk Zi, 
for i = s,...,e. TreeCV divides the hold-out chunks 
into two groups Zg, Zg+i,..., Zm and Zm+i, ■ ■ ■ Zg, where 
m = is the mid-point, and obtains the test performance 

scores for the two groups separately by recursively calling it¬ 
self. More precisely, TreeCV first updates the model by 
training it on the second group of chunks, Zmj-i, ■ ■ ■, Zg, 
resulting in the model fg,,m, and makes a recursive call 
TREECV(s,m,/s..m) to get {llk)YJlLgRi- Then, it re¬ 
peats the same procedure for the other group of chunks: start¬ 
ing from the original model /^ .e it had received, it updates 
the model, this time using the first group of the remain¬ 
ing chunks, Zg,..., Z^, that were previously held out, and 

calls TREECV(m -f l,e,/^+i,.e) to get {l/k)YTi=mGi^^ 
(for the second group of chunks). The recursion stops when 
there is only one hold-out chunk (s = e), in which case 
the performance score Rg of the model /g ^ (which is now 
trained on all the chunks except for Zg) is directly calculated 
and returned. Calli^ TreeCV( 1, n, 0) calculates Rk-cv = 
i Figure 1 shows an example of the recursive call 

tree underlying a run of the algorithm calculating the LOOCV 
estimate on a dataset of four data points. Note that the tree 
structure imposes a new order of feeding the chunks to the 
learning algorithm, e.g., and Z 4 are learned before Z 2 in 
the first branch of the tree. 

3.1 Accuracy of TreeCV 

To simplify the analysis, in this section and the next, we as¬ 
sume that each chunk is of the same size, that is n = kb for 









Ri R2 


Figure 1: An example run of TreeCV on a dataset of size 
four, calculating the LOOCV estimate. 


some integer 6 > 1. 

Note that the models /^. s used in computing Rg are 
learned incrementally. If the learning algorithm learns the 
same model no matter whether it is given the chunks all at 
once or gradually, then fgg is the same as the model fg used 
in the definition of Rk-cv, and Rk-cv = Rk-cv- If this as¬ 
sumption does not hold, then Rk-cv is still close to Rk-cv as 
long as the models /g g are sufficiently similar to their corre¬ 
sponding models fg. In the rest of this section, we formalize 
this assertion. 

First, we define the notion of stability for an incremen¬ 
tal learning algorithm. Intuitively, an incremental learn¬ 
ing algorithm is stable if the performance of the models 
are nearly the same no matter whether they are learned in¬ 
crementally or in batch. Formally, suppose that a dataset 
{zi,..., Zn} is partitioned into I -f 1 nonempty chunks 
and ^5 ™“,..., and we are using as the test data 

and the chunks ..., as the training data. Let 

Jbatch ^ £(0^ Attain (j y ^tiain^ 

from the training data when provided all at the same time, and 
let 



denote the model learned from the same chunks when 
they are provided incrementally to C. Let = 

^ (/{^)> 2 ;, y) denote the performance of a 
model / on the test data 

Definition 1 (Incremental stability). The algorithm C is g- 
incrementally stable for a function p : N x N —>■ K 
if for any dataset {zi, Z 2 ,..., b < n, and partition 
2 -test^ strain^ ^ ^ ^ ^ ^train nonempty Cells 1 < i < ( 

and = b, the test performance of the models and 


defined above satisfy 

l^test^jinc) _ ^test^ybatch^l <g(^ri-b,b). 

If the data {zi,... ,z„} is drawn independently from the 
same distribution T) over X y, y and/or the learning algo- 
ritm C is randomized, we say that C is g-incrementally stable 
in expectation if 

|g ^^te.t(/inc)j _ g jbatch^j \ < g [n - b,b) 

for all partitions selected independently of the data and the 
randomization of C. 

The following statement is an immediate consequence of 
the above definition; 


Theorem 1. Suppose n = bkfor some integer 6 > 1 and that 
algorithm C is g-incrementally stable. Then, 


Rk-cv 


Rk-cv 


<g{n-b,b). 


If C is g-incrementally stable in expectation then 


E 


Rk-cv 


— E[i?fc,cv] 


<g{n,b). 


Proof We prove the first statement only, the proof of the 
second part is essentially identical. Recall that Zj,j = 
1,2,... ,k denote the chunks used for cross-validation. Fix 
i and let I = [logfc]. Let = Zi and j = 1.. .1, 
denote the union of the chunks used for training at depth j 
of the recursion branch ending with the computation of Ri. 
Then, by definition, Ri = and R, = 

Therefore, \Ri — Ri \ < g{n — b, b), and the statement fol¬ 
lows since Rk-cv and Rk-cv are defined as the averages of 
the Ri and Ri, respectively. □ 


It is then easy to see that incremental learning methods 
with a bound on their excess risk are incrementally stable in 
expectation. 

Theorem 2. Suppose the data {zi,..., z„} is drawn inde¬ 
pendently from the same distribution T) over X y. y. Let 
{X, Y) £ X y y be drawn from T> independently of the data 
and let f* £ argmin^^g^ E[£(/(X), X, L)] denote a model 
in Xi with minimum expected loss. Assume there exist upper 
bounds — b) and — b) on the excess risks of 

ybatch trained on n' = n — b data points, such that 

E[l{f’^"^'^{X),X,Y)-e{f*{X),X,Y)] < 
and 

¥,[l{r{X), X, Y) - e{r{X),X, F)] < 

for all n and for every partitioning of the dataset that is in¬ 
dependent of the data, (X,Y), and the randomization of C. 
Then C is incrementally stable in expectation w.r.t. the loss 
function I, with g{n', b) = max{m'’^'‘^^(n'), 

Proof. Since the data points in the sets Z^^™,..., ZJ™" and 
^test ^g independent, /batch jmc ^g independent of 
Z‘“'. Hence, E[i?b='*‘(/'’^b=*')] = E[f (/>>"“=*■ (X), AT, F)] and 
g^^test(jmc)j ^ {f'f^{X),X, F)]. Therefore, 

E[i?‘“‘ 




(^batch)] 





















+ E[e{f*{X),X,Y)] - 
< E[i?'“'(/‘“)] -E[i{f*{X),X,Y)] < 
where we used the optimality of /*. Similarly, 

E[^test(/batch)j _ 

proof. □ 


In particular, for online learning algorithms satisfying 
some regret bound, standard on l ine-to-batch conversion re- 


sults i Cesa-Bianchi et aL, 2004| Kakade and Tewari, 2009) 
yield excess-risk bounds for independent and identically dis¬ 
tributed data. Similarly, excess-risk bounds are often avail¬ 
able for stochastic gradient d escent (SGD) algorithms which 
scan the data once (see, e.g., i Nemirovski et ai, 2009) ). For 
online learning algorithms (including single-pass SGD), the 
batch version is usually defined by running the algorithm us¬ 
ing a random ordering of the data points or sampling from 
the data points with replacement. Typically, this version also 
satisfies the same excess-risk bounds. Thus, the previous the¬ 
orem shows that these algorithms are are incrementally stable 
with g{n, b) being their excess-risk bound for n samples. 

Note that this incremental stability is w.r.t. the loss func¬ 
tion whose excess-risk is bounded. For example, after visiting 
n data points, the regret of PEGASOS I Shalev-Shwartz et al, 
20111 with bounded features is bounded by 0(log(n)). Using 
the online-to-batch conversion of Kakade and Tewari |2009j , 
this gives an excess risk bound m{n) = 0(log(n)/n), and 
hence PEGASOS is stable w.r.t. the regularized hinge loss 
with g{n,b) = m{n) = 0(log(n)/n). Similarly, SGD over 
a compact set with bounded features and a bounded convex 
loss is stable w.r.t. that co nvex loss with g{n, b) = 0{l/y/n) 
I Nemirovski et al, 2009) . Experiments with these algorithms 
are shown in Section]^ Finally, we note that algorithms like 
PEGASOS or SGD could also be used to scan the data multi¬ 
ple times. In such cases, these algorithms would not be useful 
incremental algorithms, as it is not clear how one should add 
a new data point without a major retraining over the previous 
points. Currently, our method does not apply to such cases in 
a straightforward way. 


4 Complexity Analysis 

In this section, we analyze the running time and storage re¬ 
quirements of TreeCV, and discuss some practical issues 
concerning its implementation, including parallelization. 

4.1 Memory Requirements 

Efficient storage of and updates to the model are crucial 
for the efficiency of Algorithm [T] Indeed, in any call of 
TreeCV(s, e, fs..e) that does not correspond to simply eval¬ 
uating a model on a chunk of data (i.e., s ^ e), TreeCV 
has to update the original model fs..e twice, once with 
Zs,...,Zm, and once with Z^+i, ■ ■ ■, Zg. To do this, 
TreeCV can either store /s.,e, or revert to /^. e from fs..m- 
In general, for any type of model, if the model for fs..e is 
modified in-place, then we need to create a copy of it before it 
is updated to the model for fs..Tn, or, alternatively, keep track 


of the changes made to the model during the update. Whether 
to use the copying or the save/revert strategy depends on the 
application and the learning algorithm. Eor example, if the 
model state is compact, copying is a useful strategy, whereas 
when the model undergoes few changes during an update, 
save/revert might be preferred. 

Compared to a single run of the learning algorithm £, 
TreeCV requires some extra storage for saving and restor¬ 
ing the models it trains along the way. When no paralleliza¬ 
tion is used in implementing TreeCV, we are in exactly one 
branch at every point during the execution of the algorithm. 
Since the largest height of a recursion branch is of 0(log k), 
and one model (or the changes made to it) is saved in each 
level of the branch, the total storage required by TreeCV is 
0(log(A:))-times the storage needed for a single model. 

TreeCV can be easily parallelized by dedicating one 
thread of computation to each of the data groups used in up¬ 
dating fs..e in one call of TreeCV. In this case one typically 
needs to copy the model since the two threads are needed 
to be able to run independently of each other; thus, the to¬ 
tal number of models TreeCV needs to store is 0{k), since 
there are 2A: — 1 total nodes in the recursive call tree, with 
exactly one model stored per node. Note that a standard par¬ 
allelized CV calculation also needs to store 0{k) models. 

Einally, note that TreeCV is potentially useful in dis¬ 
tributed environment, where each chunk of the data is stored 
on a different node in the network. Updating the model on a 
given chunk can then be relegated to that computing node (the 
model is sent to the processing node, trained and sent back, 
i.e., this is not using all the nodes at once), and it is only the 
model (or the updates made to the model), not the data, that 
needs to be communicated to the other nodes. Since at every 
level of the tree, each chunk is added to exactly one model, 
the total communication cost of doing this is 0{k log(fc)). 

Running Time 

Next, we analyze the time complexity of TreeCV when cal¬ 
culating the fc-CV score for a dataset of size n under our pre¬ 
vious simplifying assumption that n = bk for some integer 
b>l. 

The running time of TreeCV is analyzed in terms of the 
running time of the learning algorithm C and the time it takes 
to copy the models (or to save and then revert the changes 
made to it while it is being updated by C). Throughout this 
subsection, we use the following definitions and notations: 
for TO = 0, 1,. .., n, Z = 1,..., n — TO, and j = 1,..., fc, 

• > 0 denotes the time required to update a 
model, already trained on to data points, with a set of 
I additional data points; 

• ts{m, () > 0 is the time required to copy the model, (or 
save and revert the changes made to it) when the model 
is already trained on to data points and is being updated 
with I more data points; 

• t{j) is the time spent in saving, restoring, and updating 
models in a call to TreeCV e, fs..e^ with j = e — 

s + 1 hold-out chunks (and with fs..e trained on k — j 
chunks); 
















• tf: denotes the time required to test a model on one of the 
k chunks (where the model is trained on the other k — 1 
chunks); 

• T{j) denotes the total running time of 
TreeCV(s, e,/s..e) when the number of chunks 
held out is j = e — s + 1, and /g. e is already trained 
with n — bj data points. Note that T{k) is the total 
running time of TreeCV to calculate the k-CV score 
for a dataset of size n. 

By definition, for all j = 2 ... fc, we have 

t{j) = tu{n - bj, b [j/2\ )+ts{n- bj, b [j/2\) 

+ tuin - bj, b \j/2'] )+ts{n- bj, b \j/2 ]) + 4, 

where > 0 accounts for the cost of the operations other 
than the recursive function calls. 

We will analyze the running time of TreeCV under the 
following natural assumptions: First, we assume that C is not 
slower if data points are provided in batch rather than one by 
one. That is, 

771+^ —1 

tu{m,l)< tu{i,l), (1) 

i—m 

for all TO = 0,... ,n and I = 1,... ,n — Second, we 
assume that updating a model requires work comparable to 
saving it or reverting the changes made to it during the update. 
This is a natural assumption since the update procedure is 
also writing those changes. Formally, we assume that there 
is a constant c > 0 (typically c < 1) such that for all to = 
0,... ,n and I = 1,... ,n — m, 

ts{m,l) < ctu{m,l). (2) 


+ (1 + c)tu{n - bj, b \j/2']) + tc 
< (1 + c)6f* (Lj 72 J + \j/‘A) +^c 

= j tu + ic '■= aj + tc (3) 

where a = (1 + c)nt‘^/k. Next we show by induction that for 
j >2 this implies 

T[j) < aj(log2(j - 1) + 1) + (j - l)tc + jte. (4) 

Substituting j = fc in Q proves the theorem since log 2 (j — 
1) + 1 < log2(2j). By the definition of TreeCV, 

= J>2; 

Vr, j = 1- 

This implies that Q holds for j = 2,3. Assuming Q holds 
for all 2 < j' < j, 4 < j < fc, from 01 we get 

nj)=T{[j/2\) + T{\]/2-\)+t{j) 

< aj (log 2 (rj/ 2 ] - 1) + 2) + tc{j - 1) + jti 

< aj’(log2(j -!) + !)+ tcU - 1) +jti 

completing the proof of Q. □ 

For fully incremental, linear-time learning algorithms 
(such as PEGASOS or single-pass SGD), we obtain the fol¬ 
lowing upper bound: 

Corollary 4. Suppose that the learning algorithm C satisfies 
0 and fu(0, to) = mt%for some > 0 and all 1 < m < n. 
Then 

T{k) < (1 + c)Tc. log2(2fc) -I- tc{k — 1) + kti, 
where Tc = tu{0, n) is the running time of a single run of C. 


To get a quick estimate of the running time, assume for a 
moment the idealized case that k = 2‘^, tuijn, 1) = 1) 

for all TO and I, and tc = 0. Since n2~l data points are added 
to the models of a node at level j in the recursive call tree, the 
work required in such a node is (1 + c)n2~Hu{0, 1). There 
are 2^ such nodes, hence the cumulative running time at level 
j nodes is (1 + c)ntu{0, 1), hence the total running time of 
the algorithm is (1 -f c)nf„(0,1) log 2 k, where log 2 denotes 
base-2 logarithm. 

The next theorem establishes a similar logarithmic penalty 
(compared to the running time of feeding the algorithm with 
one data point at a time) in the general case. 

Theorem 3. Assume ([T]i and 0 are satisfied. Then the total 
running time of TreeCV can be bounded as 

T{k) < n(l -I- c)f* log2(2A:) + {k — 1)4 + kti, 
where f* = maxo<j<n_i 4(7 1)- 

Proof By Q, tuin - bj, 1) < YiZo + 7 1) < ^ C 

for all Z = 1,... ,bj. Combining with 0, for any 2<j<k 
we obtain 

tij) < il + c)tuin-bj,b[j/2\) 

*If this is not the case, we would always input the data one by 
one even if there are more data points available. 


5 Experiments 

In this section we evaluate TreeCV and compare it with the 
standard (fc-repetition) CV calculation. We consider two in¬ 


cremental algorithms: linear PEGASOS I Shalev-Shwartz et 
al, 201 1) for SVM classification, and least-square stochas¬ 


tic gradient descent (LSQSGD) for linear least-squares re¬ 
gression (more precisely, LSQSGD is the rob ust sto chastic 
approximation algorithm of Nemirovski et al. \ 2009) for the 
squared loss and parameter vectors constrained in the unit I 2 - 
ball). Eollowing the suggestions in the original papers, we 
take the last hypothesis from PEGASOS and the average hy¬ 
pothesis from LSQSGD as our model. We focus on the large- 
data regime in which the algorithms learn from the data in a 
single pass. 

The algorithms were implemented in Python/Cython and 
Numpy. The tests were run on a single core of a com¬ 
puter with an Intel Xeon E5430 processor and 20 GB of 


RAM. We used datasets from the UCI repository I Lichman, 
2013), downloaded from the LibSVM website I Chang and 


Lin, 2011) . 


We tested PEGASOS on the UCI Covertype dataset 
(581,012 data points, 54 features, 7 classes), learning class 
“1” against the rest of the classes. The features were scaled 
to have unit variance. The regularization parameter was set to 
A = 10“® following the suggestion of Shalev-Shwartz et al. 
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Figure 2; Running time of TreeCV and standard fc-CV for different values of fc as a function of the number of data points n, 
averaged over 100 independent repetitions. Top row; PEGASOS; bottom row; least-square SGD. Left column; fc-CV without 
permutations; middle column; fc-CV with data permutation; right column; LOOCV with and without permutations. 


1 201 1) . For LSQSGD, we used the UCl YearPredictionMSD 
dataset (463,715 data points, 90 feature s) and, following the 
suggestion of Nemirovski et al. 1 2009] , set the step-size to 
a = The target values where scaled to [0,1]. 

Naturally, PEGASOS and LSQSGD are sensitive to the or¬ 
der in which data points are provided (although they are incre¬ 
mentally stable as mentioned after Theorem]^. In a vanilla 
implementation, the order of the data points is hxed in ad¬ 
vance for the whole CV computation. That is, there is a hxed 
ordering of the chunks and of the samples within each chunk, 
and if we need to train a model with chunks Zi^,, Zi., 
the data points are given to the training algorithm accord¬ 
ing to this hierarchical ordering. This introduces certain de¬ 
pendence in the CV estimation procedure; for example, the 
model trained on chunks ,..., Zk-i has visited the data in 
a very similar order to the one trained on Zi,..., Zk_- 2 , Z]^ 
(except for the last n/k steps of the training). To eliminate 
this dependence, we also implemented a randomized version 
in which the samples used in a training phase are provided 
in a random order (that is, we take all the data points for the 
chunks Zi ^, ■ ■ •, Zi^ to be used, and feed them to the training 
algorithm in a random order). 


Table shows the values of the CV estimates computed 
under different scenarios. It can be observed that the standard 
(fc-repetition) CV method is quite sensitive to the order of the 
points; the variance of the estimate does not really decay as 
the number of folds fc increases, while we see the expected 
decay for the randomized version. On the other hand, the 
non-randomized version of TreeCV does not show such a 
behavior, as the automatic re-permutation that happens dur¬ 
ing TreeCV might have made the fc folds less correlated. 


CV estimates for PEGASOS (misclassification rate x 100) 


TreeCV 

Standard 


fixed 

randomized 

fixed 

randomized 

A’= 5 

A; = 10 
k = 100 
k = n 

30.682 ±1.2127 
30.665 ±0.8299 
30.677 ±0.3040 
30.640 ±0.0564 

30.839 ±0.9899 
30.554 ±0.7125 
30.634 ±0.2104 
30.637 ±0.0592 

30.825 ± 1.9248 
30.767 ± 1.7754 
30.636 ±2.0019 
N/A 

30.768 ±1.1243 
30.541 ±0.7993 
30.624 ±0.2337 
N/A 


CV estimates for LSQSGD (squared error xlOO) 


TreeCV 

Standard 


fixed 

randomized 

fixed 

randomized 

A- = 5 

25.299 ±0.0019 

25.298 ± 0.0018 

25.299 ±0.0019 

25.299 ±0.0017 

A: = 10 

25.297 ±0.0016 

25.297 ±0.0015 

25.297 ±0.0016 

25.297 ±0.0016 

k = 100 

25.296 ±0.0012 

25.296 ±0.0013 

25.296 ±0.0011 

25.296 ±0.0013 

k = n 

25.296 ±0.0012 

25.296 ±0.0012 

N/A 

N/A 


Table 2; fc-CV performance estimates averaged over 100 rep¬ 
etitions (and their standard deviations), for the full datasets 
with and without data repermutation; PEGASOS (top) and 
LSQSGD (bottom). 


However, randomizing the order of the training points typi¬ 
cally reduces the variance of the TreeCV- estimate, as well. 

Figurej^shows the running times of TreeCV and the stan¬ 
dard CV method, as a function of n, for PEGASOS (top row) 
and LSQSGD (bottom row). The first two columns show the 
running times for different values of fc, with and without ran¬ 
domizing the order of the data points (middle and left column, 
resp.), while the rightmost column shows the the running time 
(log-scale) for LOOCV calculations. TreeCV outperforms 
the standard method in all of the cases. It is notable that 
TreeCV makes the calculation of LOOCV practical even 
for n = 581,012, in a fraction of the time required by the 
standard method at n = 10,000; for example, for PEGA- 








































































SOS, TreeCV takes around 20 seconds (46 when randomized) 
for computing LOOCV at n = 581,012, while the standard 
method takes around 124 seconds (175 when randomized) at 
n = 10,000. Furthermore, one can see that the variance re¬ 
duction achieved by randomizing the data points comes at the 
price of a constant factor bigger running time (the factor is 
around 1.5 for the standard method, and 2 for TreeCV). This 
comes from the fact that both the training time and the time 
of generating a random perturbation is linear in the number 
of points (assuming generating a random number uniformly 
from {1, ..., n} can be done in constant time). 

6 Conclusion 

We presented a general method, TreeCV, to speed up cross- 
validation for incremental learning algorithms. The method 
is applicable to a wide range of supervised and unsupervised 
learning settings. We showed that, under mild conditions 
on the incremental learning algorithm being used, TreeCV 
computes an accurate approximation of the fc-CV estimate, 
and its running time scales logarithmically in k (the number 
of CV folds), while the running time of the standard method 
of training k separate models scales linearly with k. 

Experiments on classification and regression, using two 
well-known incremental learning algorithms, PEGASOS and 
least-square SGD, confirmed the speedup and predicted ac¬ 
curacy. When the model learned by the learning algorithm 
depends on whether the data is provided incrementally or in 
batch (or on the order of the data, as in the case of online al¬ 
gorithms), the CV estimate calculated by our method was still 
close to the CV computed by the standard method, but with a 
lower variance. 
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